Problem 8

This question involves the use of simple linear regression on the Auto data set.

library(data.table)
library(tidyverse)
auto <- fread("../Assignment 1/Auto.csv")
auto$horsepower <- as.numeric(auto$horsepower)
Warning: NAs introduced by coercion

8a

Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example:

lm <- lm(mpg ~ horsepower, data = auto)
lm

Call:
lm(formula = mpg ~ horsepower, data = auto)

Coefficients:
(Intercept)   horsepower  
    39.9359      -0.1578  
summary(lm)

Call:
lm(formula = mpg ~ horsepower, data = auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.935861   0.717499   55.66   <2e-16 ***
horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.906 on 390 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

8a i

Is there a relationship between the predictor and the re- sponse?

Yes. After horsepower is converted to numeric, there is a relationship between horsepower and mpg.

8a ii

How strong is the relationship between the predictor and the response?

Very strong. The p-value is tiny.

8a iii

Is the relationship between the predictor and the response positive or negative?

Negative. As horsepower increases, mpg decreases.

8a iv

What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

predict(lm, auto)[98]
      98 
23.36216 
# confidence interval
predict(lm, auto, interval = "confidence")[98,]
     fit      lwr      upr 
23.36216 22.87497 23.84936 
# prediction interval
predict(lm, auto, interval = "prediction")[98,]
     fit      lwr      upr 
23.36216 13.70483 33.01950 

8b

Plot the response and the predictor. Use the abline() function to display the least squares regression line.

plot(auto$horsepower, auto$mpg)
abline(lm)

8c

Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

plot(lm)

From the textbook, Figure 3.9: “Plots of residuals versus predicted (or fitted) values for the Auto data set. In each plot, the red line is a smooth fit to the residuals, intended to make it easier to identify a trend. Left: A linear regression of mpg on horsepower. A strong pattern in the residuals indicates non-linearity in the data. Right: A linear regression of mpg on horsepower and horsepower. There is little pattern in the residuals.

9

This question involves the use of multiple linear regression on the Auto data set.

9a

Produce a scatterplot matrix which includes all of the variables in the data set.

auto$name <- as.factor(auto$name)
pairs(auto)

9b

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

cor(auto[, -9])
                    mpg  cylinders displacement horsepower
mpg           1.0000000 -0.7762599   -0.8044430         NA
cylinders    -0.7762599  1.0000000    0.9509199         NA
displacement -0.8044430  0.9509199    1.0000000         NA
horsepower           NA         NA           NA          1
weight       -0.8317389  0.8970169    0.9331044         NA
acceleration  0.4222974 -0.5040606   -0.5441618         NA
year          0.5814695 -0.3467172   -0.3698041         NA
origin        0.5636979 -0.5649716   -0.6106643         NA
                 weight acceleration       year     origin
mpg          -0.8317389    0.4222974  0.5814695  0.5636979
cylinders     0.8970169   -0.5040606 -0.3467172 -0.5649716
displacement  0.9331044   -0.5441618 -0.3698041 -0.6106643
horsepower           NA           NA         NA         NA
weight        1.0000000   -0.4195023 -0.3079004 -0.5812652
acceleration -0.4195023    1.0000000  0.2829009  0.2100836
year         -0.3079004    0.2829009  1.0000000  0.1843141
origin       -0.5812652    0.2100836  0.1843141  1.0000000

9c

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

lm <- lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, auto)
lm

Call:
lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
    acceleration + year + origin, data = auto)

Coefficients:
 (Intercept)     cylinders  displacement    horsepower  
  -17.218435     -0.493376      0.019896     -0.016951  
      weight  acceleration          year        origin  
   -0.006474      0.080576      0.750773      1.426140  
summary(lm)

Call:
lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
    acceleration + year + origin, data = auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.328 on 384 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.8215,    Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

9c i

Is there a relationship between the predictors and the re- sponse?

There is a relationship between mpg and displacement, weight, year, and origin.

9c ii

Which predictors appear to have a statistically significant relationship to the response?

There is a relationship between mpg and displacement, weight, year, and origin.

9c iii

What does the coefficient for the year variable suggest?

This suggests that the mpg increases alongside the year. (Cars become more efficient over time).

9d

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

plot(lm)

There are a decent amount of outliers among the dataset. There are a few observations with strangely high leverage.

9e

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

summary(lm(mpg ~ cylinders * displacement + horsepower:weight, auto))

Call:
lm(formula = mpg ~ cylinders * displacement + horsepower:weight, 
    data = auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.8087  -2.2841  -0.5424   2.1259  17.5877 

Coefficients:
                         Estimate Std. Error t value
(Intercept)             5.524e+01  2.384e+00  23.167
cylinders              -3.831e+00  5.339e-01  -7.176
displacement           -1.388e-01  1.511e-02  -9.184
cylinders:displacement  1.923e-02  2.171e-03   8.857
horsepower:weight      -2.231e-05  2.961e-06  -7.533
                       Pr(>|t|)    
(Intercept)             < 2e-16 ***
cylinders              3.68e-12 ***
displacement            < 2e-16 ***
cylinders:displacement  < 2e-16 ***
horsepower:weight      3.54e-13 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.164 on 387 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.7182,    Adjusted R-squared:  0.7153 
F-statistic: 246.6 on 4 and 387 DF,  p-value: < 2.2e-16

I guessed and it turns out that my suspicions were correct. cylinders:diplacement and horsepower:weight were significant.

9f

Try a few different transformations of the variables, such as \(log(X)\), \(\sqrt{X}\), \(X^2\). Comment on your findings.

summary(lm(log(mpg) ~ cylinders^2, auto))

Call:
lm(formula = log(mpg) ~ cylinders^2, data = auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.61694 -0.12149 -0.00995  0.12358  0.62573 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.002754   0.032379  123.62   <2e-16 ***
cylinders   -0.165149   0.005664  -29.16   <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1918 on 395 degrees of freedom
Multiple R-squared:  0.6828,    Adjusted R-squared:  0.682 
F-statistic: 850.3 on 1 and 395 DF,  p-value: < 2.2e-16
summary(lm(log(mpg) ~ cylinders^2 + log(horsepower)^2:sqrt(weight), auto))

Call:
lm(formula = log(mpg) ~ cylinders^2 + log(horsepower)^2:sqrt(weight), 
    data = auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.47818 -0.10004 -0.00615  0.10030  0.51869 

Coefficients:
                               Estimate Std. Error t value
(Intercept)                   4.4881118  0.0421172 106.563
cylinders                    -0.0211598  0.0107477  -1.969
log(horsepower):sqrt(weight) -0.0050924  0.0003451 -14.756
                             Pr(>|t|)    
(Intercept)                    <2e-16 ***
cylinders                      0.0497 *  
log(horsepower):sqrt(weight)   <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1535 on 389 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.7972,    Adjusted R-squared:  0.7961 
F-statistic: 764.4 on 2 and 389 DF,  p-value: < 2.2e-16

I’m a little surprised that the regressions didn’t break as easily as I thought they would with some basic transformations.

10

This question should be answered using the Carseats data set.

carseats <- ISLR2::Carseats

10a

Fit a multiple regression model to predict Sales using Price, Urban, and US.

lm <- lm(Sales ~ Price + Urban + US, carseats)
lm

Call:
lm(formula = Sales ~ Price + Urban + US, data = carseats)

Coefficients:
(Intercept)        Price     UrbanYes        USYes  
   13.04347     -0.05446     -0.02192      1.20057  
summary(lm)

Call:
lm(formula = Sales ~ Price + Urban + US, data = carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
Price       -0.054459   0.005242 -10.389  < 2e-16 ***
UrbanYes    -0.021916   0.271650  -0.081    0.936    
USYes        1.200573   0.259042   4.635 4.86e-06 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

10b

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

As Price increases by 1, Sales decrease by \(.054459\) When Urban is Yes, Sales decrease by \(.021916\) (but this isn’t significant) When US is Yes, Sales increases by \(1.021916\)

10c

Write out the model in equation form, being careful to handle the qualitative variables properly.

\(\textrm{Sales}=13.04-.05\textrm{Price}-(.02\textrm{Urban}_{\textrm{Yes}})+(1.20\textrm{US}_{\textrm{Yes}})\)

10d

For which of the predictors can you reject the null hypothesis \(H0:β_j=0\) ?

Price and USYes are significant enough with a p-value less than 0.05 so we are 95% confident that they are significant.

10e

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm <- lm(Sales ~ Price + US, carseats)
lm

Call:
lm(formula = Sales ~ Price + US, data = carseats)

Coefficients:
(Intercept)        Price        USYes  
   13.03079     -0.05448      1.19964  
summary(lm)

Call:
lm(formula = Sales ~ Price + US, data = carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9269 -1.6286 -0.0574  1.5766  7.0515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
Price       -0.05448    0.00523 -10.416  < 2e-16 ***
USYes        1.19964    0.25846   4.641 4.71e-06 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2354 
F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

10f

How well do the models in (a) and (e) fit the data?

The model in (e) fits much better than the one in (a) because it only has significant variables.

10g

Using the model from (e), obtain \(95\%\) confidence intervals for the coefficient(s).

confint(lm)
                  2.5 %      97.5 %
(Intercept) 11.79032020 14.27126531
Price       -0.06475984 -0.04419543
USYes        0.69151957  1.70776632

10h

Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2,2))
plot(lm)

There are a few points that look to be outside the normal range. Points 51, 69, and 377 seem to be outliers and points 26, 50, and 368 seem to have high leverage. I’m not worried about them because there are only three in a good-sized dataset.

4

From the data available at http://www.stat.ufl.edu/~winner/datasets.html, obtain the data on “Fibre Diameters and Breaking Strenghs for Nextel 610 Fibres.” (please note that there is a typo on the website. It should be Strength and not Strengh) According to the description available there, the expectation is that the log of breaking strength of the fibre should be negatively and linearly related to diameter. (Note log here means natural log if not specified.)

dt <- fread("fibre.csv")
colnames(dt) <- c("stren", "diam")
head(dt)

4a

Produce a scatter plot of breaking strength against diameter.

plot(dt$diam, dt$stren)

4b

Produce a scatter plot of the log of breaking strength against diameter.

dt$logStren <- log(dt$stren)
plot(dt$logStren, dt$diam)

4c

Produce a scatter plot of the log of breaking strength against the log of diameter.

dt$logDiam <- log(dt$diam)
plot(dt$logStren, dt$logDiam)

4d

Regress breaking strength on diameter.

lm <- lm(stren ~ diam, data = dt)
lm

Call:
lm(formula = stren ~ diam, data = dt)

Coefficients:
(Intercept)         diam  
     5312.8       -197.7  
summary(lm)

Call:
lm(formula = stren ~ diam, data = dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-603.46 -246.10   -6.76  223.64 1047.22 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5312.78    1040.41   5.106 5.61e-06 ***
diam         -197.73      91.91  -2.151   0.0365 *  
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 335.7 on 48 degrees of freedom
Multiple R-squared:  0.08794,   Adjusted R-squared:  0.06894 
F-statistic: 4.628 on 1 and 48 DF,  p-value: 0.03651

4e

Regress the log of breaking strength on diameter.

lm <- lm(logStren ~ diam, data = dt)
lm

Call:
lm(formula = logStren ~ diam, data = dt)

Coefficients:
(Intercept)         diam  
    8.76091     -0.06504  
summary(lm)

Call:
lm(formula = logStren ~ diam, data = dt)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.211992 -0.075119  0.003561  0.076367  0.293763 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.76091    0.33575  26.094   <2e-16 ***
diam        -0.06504    0.02966  -2.193   0.0332 *  
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1083 on 48 degrees of freedom
Multiple R-squared:  0.09105,   Adjusted R-squared:  0.07211 
F-statistic: 4.808 on 1 and 48 DF,  p-value: 0.0332

4f

Regress the log of breaking strength on the log of diameter.

lm <- lm(logStren ~ logDiam, data = dt)
lm

Call:
lm(formula = logStren ~ logDiam, data = dt)

Coefficients:
(Intercept)      logDiam  
     9.8328      -0.7454  
summary(lm)

Call:
lm(formula = logStren ~ logDiam, data = dt)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.21123 -0.07610  0.00396  0.07673  0.29398 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.8328     0.8209  11.978 4.99e-16 ***
logDiam      -0.7454     0.3385  -2.202   0.0325 *  
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1083 on 48 degrees of freedom
Multiple R-squared:  0.09174,   Adjusted R-squared:  0.07282 
F-statistic: 4.848 on 1 and 48 DF,  p-value: 0.03251

5

From the data available at http://www.stat.ufl.edu/~winner/datasets.html, obtain the data on “Variables associated with Permeability and Porosity of Rocks”

dt <- fread("rocks.csv")
colnames(dt) <- c("type", "density", "porosity", "logPermeability", "residue", "carbonate", "aGrain", "aSD", "bGrain", "bSD", "calcite", "dolomite")
head(dt)

5a

Fit a multiple regression model to predict porosity. Please provide a clean model with only significant variables. Interpret your model results and diagnostics.

lm <- lm(porosity ~ ., data = dt)
lmSumm <- summary(lm)
coefs <- coef(lmSumm)[-1,4] <= .05

# This prunes the regression until only significant variables remain
while(length(coefs) != sum(coefs)) {
  form <- paste("porosity ~", paste(names(coefs[coefs]), "+", collapse = " "), collapse = " ")
  form <- substr(form, 1, nchar(form) - 2)
  form
  lm <- lm(formula = form, data = dt)
  lmSumm <- summary(lm)
  coefs <- coef(lmSumm)[-1,4] <= .05
}
lm

Call:
lm(formula = form, data = dt)

Coefficients:
    (Intercept)  logPermeability  
          6.240            2.666  
lmSumm

Call:
lm(formula = form, data = dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1522 -1.0096 -0.2271  0.3610  5.8290 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)        6.240      2.046   3.050  0.00485 **
logPermeability    2.666      1.222   2.181  0.03740 * 
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.689 on 29 degrees of freedom
Multiple R-squared:  0.141, Adjusted R-squared:  0.1113 
F-statistic: 4.759 on 1 and 29 DF,  p-value: 0.0374
par(mfrow = c(2,2))
plot(lm)

5b

Fit a multiple regression model to predict log(permeability). Please provide a clean model with only significant variables. Interpret your model results and diagnostics.

lm <- lm(logPermeability ~ ., data = dt)
lmSumm <- summary(lm)
coefs <- coef(lmSumm)[-1,4] <= .05

# This prunes the regression until only significant variables remain
while(length(coefs) != sum(coefs)) {
  form <- paste("logPermeability ~", paste(names(coefs[coefs]), "+", collapse = " "), collapse = " ")
  form <- substr(form, 1, nchar(form) - 2)
  form
  lm <- lm(formula = form, data = dt)
  lmSumm <- summary(lm)
  coefs <- coef(lmSumm)[-1,4] <= .05
}
lm

Call:
lm(formula = form, data = dt)

Coefficients:
(Intercept)     porosity       aGrain       bGrain  
  -0.448085     0.063658     0.897464    -0.994242  
    calcite  
   0.003091  
lmSumm

Call:
lm(formula = form, data = dt)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.58844 -0.06918  0.02635  0.09975  0.40127 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) -0.448085   0.499277  -0.897  0.37771   
porosity     0.063658   0.023650   2.692  0.01227 * 
aGrain       0.897464   0.430584   2.084  0.04710 * 
bGrain      -0.994242   0.431684  -2.303  0.02952 * 
calcite      0.003091   0.001066   2.899  0.00751 **
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2019 on 26 degrees of freedom
Multiple R-squared:  0.4452,    Adjusted R-squared:  0.3598 
F-statistic: 5.216 on 4 and 26 DF,  p-value: 0.003204
par(mfrow = c(2,2))
plot(lm)

---
title: "Assignment 2"
author: "Gus Lipkin ~ glipkin6737@floridapoly.edu"
output:
  html_notebook: default
  pdf_document: default
---

# Problem 8
> This question involves the use of simple linear regression on the Auto data set.

```{r}
library(data.table)
library(tidyverse)
auto <- fread("../Assignment 1/Auto.csv")
auto$horsepower <- as.numeric(auto$horsepower)
```


## 8a
> Use the `lm()` function to perform a simple linear regression with `mpg` as the response and `horsepower` as the predictor. Use the `summary()` function to print the results. Comment on the output. For example:

```{r}
lm <- lm(mpg ~ horsepower, data = auto)
lm
summary(lm)
```


### 8a i
> Is there a relationship between the predictor and the re- sponse?

Yes. After horsepower is converted to numeric, there is a relationship between `horsepower` and `mpg`.

### 8a ii
> How strong is the relationship between the predictor and the response?

Very strong. The p-value is tiny.

### 8a iii
> Is the relationship between the predictor and the response positive or negative?

Negative. As `horsepower` increases, `mpg` decreases.

### 8a iv
> What is the predicted `mpg` associated with a `horsepower` of 98? What are the associated 95% confidence and prediction intervals?

```{r}
predict(lm, auto)[98]
# confidence interval
predict(lm, auto, interval = "confidence")[98,]
# prediction interval
predict(lm, auto, interval = "prediction")[98,]
```


## 8b
> Plot the response and the predictor. Use the `abline()` function to display the least squares regression line.

```{r}
plot(auto$horsepower, auto$mpg)
abline(lm)
```


## 8c
> Use the `plot()` function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

```{r}
plot(lm)
```

From the textbook, Figure 3.9: "*Plots of residuals versus predicted (or fitted) values for the Auto data set. In each plot, the red line is a smooth fit to the residuals, intended to make it easier to identify a trend. Left: A linear regression of `mpg` on `horsepower`. A strong pattern in the residuals indicates non-linearity in the data. Right: A linear regression of `mpg` on `horsepower` and `horsepower`. There is little pattern in the residuals.*"

# 9
> This question involves the use of multiple linear regression on the `Auto` data set.

## 9a
> Produce a scatterplot matrix which includes all of the variables in the data set.

```{r}
auto$name <- as.factor(auto$name)
pairs(auto)
```

## 9b
> Compute the matrix of correlations between the variables using the function `cor()`. You will need to exclude the `name` variable, which is qualitative.

```{r}
cor(auto[, -9])
```

## 9c
> Use the `lm()` function to perform a multiple linear regression with `mpg` as the response and all other variables except `name` as the predictors. Use the `summary()` function to print the results. Comment on the output. For instance:

```{r}
lm <- lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, auto)
lm
summary(lm)
```


### 9c i
> Is there a relationship between the predictors and the re- sponse?

There is a relationship between `mpg` and `displacement`, `weight`, `year`, and `origin`.

### 9c ii
> Which predictors appear to have a statistically significant relationship to the response?

There is a relationship between `mpg` and `displacement`, `weight`, `year`, and `origin`.

### 9c iii
> What does the coefficient for the `year` variable suggest?

This suggests that the `mpg` increases alongside the `year`. (Cars become more efficient over time).

## 9d
> Use the `plot()` function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

```{r}
plot(lm)
```

There are a decent amount of outliers among the dataset. There are a few observations with strangely high leverage.

## 9e
> Use the `*` and `:` symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

```{r}
summary(lm(mpg ~ cylinders * displacement + horsepower:weight, auto))
```

I guessed and it turns out that my suspicions were correct. `cylinders:diplacement` and `horsepower:weight` were significant.

## 9f
> Try a few different transformations of the variables, such as $log(X)$, $\sqrt{X}$, $X^2$. Comment on your findings.

```{r}
summary(lm(log(mpg) ~ cylinders^2, auto))
summary(lm(log(mpg) ~ cylinders^2 + log(horsepower)^2:sqrt(weight), auto))
```

I'm a little surprised that the regressions didn't break as easily as I thought they would with some basic transformations.

# 10
> This question should be answered using the `Carseats` data set.

```{r}
carseats <- ISLR2::Carseats
```


## 10a
> Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US.`

```{r}
lm <- lm(Sales ~ Price + Urban + US, carseats)
lm
summary(lm)
```

## 10b
> Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

As `Price` increases by 1, `Sales` decrease by $.054459$
When `Urban` is `Yes`, `Sales` decrease by $.021916$ (but this isn't significant)
When `US` is `Yes`, `Sales` increases by $1.021916$

## 10c
> Write out the model in equation form, being careful to handle the qualitative variables properly.

$\textrm{Sales}=13.04-.05\textrm{Price}-(.02\textrm{Urban}_{\textrm{Yes}})+(1.20\textrm{US}_{\textrm{Yes}})$

## 10d
> For which of the predictors can you reject the null hypothesis $H0:β_j=0$ ?

`Price` and `USYes` are significant enough with a p-value less than 0.05 so we are 95% confident that they are significant.

## 10e
> On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

```{r}
lm <- lm(Sales ~ Price + US, carseats)
lm
summary(lm)
```

## 10f
> How well do the models in (a) and (e) fit the data?

The model in (e) fits much better than the one in (a) because it only has significant variables. 

## 10g
> Using the model from (e), obtain $95\%$ confidence intervals for the coefficient(s).

```{r}
confint(lm)
```

## 10h
> Is there evidence of outliers or high leverage observations in the model from (e)?

```{r}
par(mfrow = c(2,2))
plot(lm)
```

There are a few points that look to be outside the normal range. Points 51, 69, and 377 seem to be outliers and points 26, 50, and 368 seem to have high leverage. I'm not worried about them because there are only three in a good-sized dataset. 

# 4
> From the data available at [http://www.stat.ufl.edu/~winner/datasets.html](http://www.stat.ufl.edu/~winner/datasets.html), obtain the data on “Fibre Diameters and Breaking Strenghs for Nextel 610 Fibres.” (please note that there is a typo on the website. It should be Strength and not Strengh) According to the description available there, the expectation is that the log of breaking strength of the fibre should be negatively and linearly related to diameter. (Note log here means natural log if not specified.)

```{r}
dt <- fread("fibre.csv")
colnames(dt) <- c("stren", "diam")
head(dt)
```


## 4a
> Produce a scatter plot of breaking strength against diameter.

```{r}
plot(dt$diam, dt$stren)
```

## 4b
> Produce a scatter plot of the log of breaking strength against diameter. 

```{r}
dt$logStren <- log(dt$stren)
plot(dt$logStren, dt$diam)
```

## 4c 
> Produce a scatter plot of the log of breaking strength against the log of diameter.

```{r}
dt$logDiam <- log(dt$diam)
plot(dt$logStren, dt$logDiam)
```

## 4d 
> Regress breaking strength on diameter. 

```{r}
lm <- lm(stren ~ diam, data = dt)
lm
summary(lm)
```

## 4e
> Regress the log of breaking strength on diameter. 

```{r}
lm <- lm(logStren ~ diam, data = dt)
lm
summary(lm)
```


## 4f 
> Regress the log of breaking strength on the log of diameter. 

```{r}
lm <- lm(logStren ~ logDiam, data = dt)
lm
summary(lm)
```

# 5
> From the data available at [http://www.stat.ufl.edu/~winner/datasets.html](http://www.stat.ufl.edu/~winner/datasets.html), obtain the data on “Variables associated with Permeability and Porosity of Rocks”

```{r}
dt <- fread("rocks.csv")
colnames(dt) <- c("type", "density", "porosity", "logPermeability", "residue", "carbonate", "aGrain", "aSD", "bGrain", "bSD", "calcite", "dolomite")
head(dt)
```


## 5a
> Fit a multiple regression model to predict porosity. Please provide a clean model with only significant variables. Interpret your model results and diagnostics.

```{r}
lm <- lm(porosity ~ ., data = dt)
lmSumm <- summary(lm)
coefs <- coef(lmSumm)[-1,4] <= .05

# This prunes the regression until only significant variables remain
while(length(coefs) != sum(coefs)) {
  form <- paste("porosity ~", paste(names(coefs[coefs]), "+", collapse = " "), collapse = " ")
  form <- substr(form, 1, nchar(form) - 2)
  form
  lm <- lm(formula = form, data = dt)
  lmSumm <- summary(lm)
  coefs <- coef(lmSumm)[-1,4] <= .05
}
lm
lmSumm
par(mfrow = c(2,2))
plot(lm)
```


## 5b
> Fit a multiple regression model to predict log(permeability). Please provide a clean model with only significant variables. Interpret your model results and diagnostics.  

```{r}
lm <- lm(logPermeability ~ ., data = dt)
lmSumm <- summary(lm)
coefs <- coef(lmSumm)[-1,4] <= .05

# This prunes the regression until only significant variables remain
while(length(coefs) != sum(coefs)) {
  form <- paste("logPermeability ~", paste(names(coefs[coefs]), "+", collapse = " "), collapse = " ")
  form <- substr(form, 1, nchar(form) - 2)
  form
  lm <- lm(formula = form, data = dt)
  lmSumm <- summary(lm)
  coefs <- coef(lmSumm)[-1,4] <= .05
}
lm
lmSumm
par(mfrow = c(2,2))
plot(lm)
```

